Feature Selection: An Exploration of Algorithm Performance
Jason Case, Abhishek Dugar, Daniel Nkuah, Khoa Tran
2024-07-30
Introduction
What is feature selection? Why do we care?
Introduction: Early Techniques
forward, backward, and stepwise variable selection in linear models.
Univariate screening procedures
Introduction: Modern Techniques
Advanced feature selection methods:
- similarity-based
- information-theoretical-based
- sparse-learning-based
- statistical-based
Classification of methods
- Filter
- Wrapper
- Embedded
- Hybrid
Methods: Correlation - Based Feature Selection (CFS)
- Filter method
- Uses the correlation coefficient to measure the relationship between each variable and the target, independently.
Methods: Recursive Feature Elimination (RFE)
- Wrapper method
- Removes variables iteratively.
Steps:
- Train the classifier
- compute the ranking criterion for all features
- remove the features with smallest ranking values
Methods: Least Absolute Shrinkage and Selection Operator (LASSO)
- Embedded method
- Uses L1 Regularization to shrink coefficients to zero,
Methods: CFS & RFE
- Hybrid method
- Combines filter and wrapper methods
Step 1: filtering using the correlation coefficient. (e.g. trim the “fat”low hanging fruit”) Step 2: remove variables iteratively using RFE.
Analysis: Spambase Dataset
- 4601 instances of emails
- 57 features for classification tasks
- Binary classification: email is spam (1) or not (0)
- 80/20 train/test split
- Increased number of features by adding all two-way interactions.
Analysis: COVID-19 NLP Text Classification Dataset
- 45k tweets related to COVID-19, labeled for sentiment analysis.
- Five Sentiment label classes, ranging from extremely positive to extremely negative
- Recoded to a binary classification task, positive (1) or negative (0)
- 33,444 (91%) training and 3,179 (9%) testing records after recoding
“Bag of words”
- text data converted into a matrix of word frequencies
- each row represents a document
- each column represents a unique word from the entire corpus
- Large, sparse feature set
Statistical Modeling
Three metrics:
Accuracy on the test set3006
difference between accuracy on the training and test set (overtraining)
number of variables selected (model complexity)
Baseline: Full logistics model with no feature selection CFS: Select best of 20 correlation thresholds using cross validation RFE: Select best of 20 sized subsets using cross validation LASSO: Select best penalty term using cross validation CFS + FRE: Remove 20% of variables with lowest correlation, Select best of 20 sized subsets using cross validation
Results: Spambase Dataset
| Baseline |
1653 |
0.915 |
0.070 |
| CFS |
126 |
0.936 |
-0.004 |
| RFE |
1144 |
0.911 |
0.067 |
| LASSO |
50 |
0.903 |
-0.006 |
| CFS + RFE |
683 |
0.920 |
0.045 |
Results: COVID-19 NLP Text Classification Dataset
| Baseline |
4820 |
0.846 |
0.097 |
| CFS |
1449 |
0.868 |
0.042 |
| RFE |
2980 |
0.863 |
0.059 |
| LASSO |
4472 |
0.880 |
0.045 |
| CFS + RFE |
2672 |
0.778 |
0.088 |
Results: Records per Feature